Assignment I

Arjun Ghumman

Pitfalls of using Boxplots

Q1.1 a) Examine a number of recent issues of a scientific or statistical journal in which you have some interest. Find one or more examples of a graph or table that is a particularly bad use of display material to summarize and communicate research findings. Write a few sentences indicating how or why the display fails and how it could be improved.

Most of the articles I reviewed utilitzed boxplots, while not inherently problematic there are superior alternatives. Boxplots are not good at displaying the underlying distribution of data points and the number of observations in each group. They can be improved by adding jitter to show the distribution of individual data points, or by using violin plots which display the distribution of the data for each group. Additionally, showing the sample size of each group on the chart can help to indicate if any group is under-represented. However, even with these improvements, the boxplot is still a summary of the data, and information about the underlying distribution and sample size may be lost. Therefore, it is important to consider other types of plots and data visualization techniques that better display the underlying data such as violin plots, violin plots with jitter etc.

Q1.2 As in the previous exercise, examine the recent literature in recent issues of some journal of interest to you. Find one or more examples of a graph or table that you feel does a good job of summarizing and communicating research findings.

The aformentioned image utilizes a histogram to examine a negative binomial distribution. A negative binomial distribution is used when the number of successes is of interest. In this case, the authors do a poor job of visualizing their results that the above data was a poor fit to the negative binomial distribution. To improve the visualization, the authors could have sorted the histogram, use a culmative density function plot, QQ plots and include a reference line of the mean and median of the distribution.

Q1.3 Infographics are another form of visual displays, quite different from the data graphics featured in this book, but often based on some data or analysis. Do a Google image search for the topic “Global warming” to see a rich collection.

The above inforaphic template utilizes barplots to make certain inferences about global warming. The general conclusion tends to indicate that global warming is mainly caused by humans, various scientists were polled and the percentage of scientists agreeing that global warming was caused by human are illustrated in the left plot whereas, the right plot shows the percentage of scientists that disagree that global warming was caused by humans.

For each of the following data sets in the vcdExtra package, identify which are response variable(s) and which are explanatory. For factor variables, which are unordered (nominal) and which should be treated as ordered? Write a sentence or two describing substantitive questions of interest for analysis of the data. (Hint: use data(foo, package=“vcdExtra”) to load, and str(foo), help(foo) to examine data set foo.)

Abortion %>% glimpse()
##  'table' num [1:2, 1:2, 1:2] 171 152 138 167 79 148 112 133
##  - attr(*, "dimnames")=List of 3
##   ..$ Sex             : chr [1:2] "Female" "Male"
##   ..$ Status          : chr [1:2] "Lo" "Hi"
##   ..$ Support_Abortion: chr [1:2] "Yes" "No"

All three variables are dichotomously scored, with Status serving as a potential ordered factor.

Caesar %>% glimpse()
##  'table' num [1:3, 1:2, 1:2, 1:2] 0 1 17 0 1 1 11 17 30 4 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Infection  : chr [1:3] "Type 1" "Type 2" "None"
##   ..$ Risk       : chr [1:2] "Yes" "No"
##   ..$ Antibiotics: chr [1:2] "Yes" "No"
##   ..$ Planned    : chr [1:2] "Yes" "No"

Three variables are dichotomously scored, all serve as unordered factors. Infection is a scored on a range of None, Type 1 and Type 2 potentially serving as an ordered factor depending if the operationalization of the variable goes from low to high.

DaytonSurvey %>% apply(2, unique)
## $cigarette
## [1] "Yes" "No" 
## 
## $alcohol
## [1] "Yes" "No" 
## 
## $marijuana
## [1] "Yes" "No" 
## 
## $sex
## [1] "female" "male"  
## 
## $race
## [1] "white" "other"
## 
## $Freq
##  [1] "405" " 13" "  1" "268" "218" " 17" "117" "453" " 28" "228" "201" "133"
## [13] " 23" "  2" "  0" " 19" " 12" " 30" " 18" "  8"

DaytonSurvey consists of 6 variables with 5 serving as dichotomous unordered variables and the final variable examines frequency count.

Hoyt %>% glimpse()
##  'table' num [1:4, 1:3, 1:7, 1:2] 87 3 17 105 216 4 14 118 256 2 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Status    : chr [1:4] "College" "School" "Job" "Other"
##   ..$ Rank      : chr [1:3] "Low" "Middle" "High"
##   ..$ Occupation: chr [1:7] "1" "2" "3" "4" ...
##   ..$ Sex       : chr [1:2] "Male" "Female"

The Hoyt dataset consists of 4 variables with Status, Rank and occupation serving as potential ordered factors.

The data set DanishWelfare in vcd gives a 4-way, 3 x 4 x 3 x 5 table as a data frame in frequency form, containing the variable Freq and four factors, Alcohol, Income, Status and Urban. The variable Alcohol can be considered as the response variable, and the others as possible predictors.

DanishWelfare %>% as_tibble()
## # A tibble: 180 × 5
##     Freq Alcohol Income Status  Urban        
##    <dbl> <fct>   <fct>  <fct>   <fct>        
##  1     1 <1      0-50   Widow   Copenhagen   
##  2     4 <1      0-50   Widow   SubCopenhagen
##  3     1 <1      0-50   Widow   LargeCity    
##  4     8 <1      0-50   Widow   City         
##  5     6 <1      0-50   Widow   Country      
##  6    14 <1      0-50   Married Copenhagen   
##  7     8 <1      0-50   Married SubCopenhagen
##  8    41 <1      0-50   Married LargeCity    
##  9   100 <1      0-50   Married City         
## 10   175 <1      0-50   Married Country      
## # … with 170 more rows
DanishWelfare %>% nrow()
## [1] 180
DanishWelfare %>% .$Freq %>% sum
## [1] 5144

There are 180 cases in the data frame. There are 5144 cases.

DanishWelfare <- DanishWelfare %>% mutate(Alcohol = Alcohol %>% ordered(),
                                          Income = Income %>% ordered())

DanishWelfare %>% str
## 'data.frame':    180 obs. of  5 variables:
##  $ Freq   : num  1 4 1 8 6 14 8 41 100 175 ...
##  $ Alcohol: Ord.factor w/ 3 levels "<1"<"1-2"<">2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Income : Ord.factor w/ 4 levels "0-50"<"50-100"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Status : Factor w/ 3 levels "Widow","Married",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ Urban  : Factor w/ 5 levels "Copenhagen","SubCopenhagen",..: 1 2 3 4 5 1 2 3 4 5 ...

The above code ordered the factors.

ftable(xtabs(Freq ~Alcohol+Income+Status+Urban, data = DanishWelfare))
##                           Urban Copenhagen SubCopenhagen LargeCity City Country
## Alcohol Income  Status                                                         
## <1      0-50    Widow                    1             4         1    8       6
##                 Married                 14             8        41  100     175
##                 Unmarried                6             1         2    6       9
##         50-100  Widow                    8             2         7   14       5
##                 Married                 42            51        62  234     255
##                 Unmarried                7             5         9   20      27
##         100-150 Widow                    2             3         1    5       2
##                 Married                 21            30        23   87      77
##                 Unmarried                3             2         1   12       4
##         >150    Widow                   42            29        17   95      46
##                 Married                 24            30        50  167     232
##                 Unmarried               33            24        15   64      68
## 1-2     0-50    Widow                    3             0         1    4       2
##                 Married                 15             7        15   25      48
##                 Unmarried                2             3         9    9       7
##         50-100  Widow                    1             1         3    8       4
##                 Married                 39            59        68  172     143
##                 Unmarried               12             3        11   20      23
##         100-150 Widow                    5             4         1    9       4
##                 Married                 32            68        43  128      86
##                 Unmarried                6            10         5   21      15
##         >150    Widow                   26            34        14   48      24
##                 Married                 43            76        70  198     136
##                 Unmarried               36            23        48   89      64
## >2      0-50    Widow                    2             0         2    1       0
##                 Married                  1             2         2    7       7
##                 Unmarried                3             0         1    5       1
##         50-100  Widow                    3             0         2    1       3
##                 Married                 14            21        14   38      35
##                 Unmarried                2             0         3   12      13
##         100-150 Widow                    2             1         1    1       0
##                 Married                 20            31        10   36      21
##                 Unmarried                0             2         3    9       7
##         >150    Widow                   21            13         5   20       8
##                 Married                 23            47        21   53      36
##                 Unmarried               38            20        13   39      26
DanishWelfare %>% group_by(Urban) %>% dplyr::summarise(Total = sum(Freq))
## # A tibble: 5 × 2
##   Urban         Total
##   <fct>         <dbl>
## 1 Copenhagen      552
## 2 SubCopenhagen   614
## 3 LargeCity       594
## 4 City           1765
## 5 Country        1619
DanishWelfare <- DanishWelfare %>% mutate(Urban = fct_collapse(Urban,City = levels(DanishWelfare$Urban)[1:4], NonCity = levels(DanishWelfare$Urban)[5]))

str(DanishWelfare)
## 'data.frame':    180 obs. of  5 variables:
##  $ Freq   : num  1 4 1 8 6 14 8 41 100 175 ...
##  $ Alcohol: Ord.factor w/ 3 levels "<1"<"1-2"<">2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Income : Ord.factor w/ 4 levels "0-50"<"50-100"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Status : Factor w/ 3 levels "Widow","Married",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ Urban  : Factor w/ 2 levels "City","NonCity": 1 1 1 1 2 1 1 1 1 2 ...